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ABSTRACT 

Reverberation is damaging to both the quality and the intelli¬ 
gibility of a speech signal. We propose a novel single-channel 
method of dereverberation based on a linear filter in the Short 
Time Fourier Transform domain. Each enhanced frame is 
constructed from a linear sum of nearby frames based on the 
channel impulse response. The results show that the method 
can resolve any reverberant signal with knowledge of the im¬ 
pulse response to a non-reverberant signal. 

Index Terms — dereverberation, inverse channel filtering, 
speech enhancement 

1. INTRODUCTION 

Speech is inherently non-stationary, therefore speech process¬ 
ing algorithms are frequently applied to short frames in which 
the speech is quasi-stationary. Furthermore, speech is sparse 
in the time-frequency domain, allowing us to distinguish and 
enhance the speech content well. Therefore the Short Time 
Fourier Transform (STFT) domain is the domain of choice 
for many speech and audio based algorithms. 

Reverberation occurs from multi-path propagation of an 
acoustic signal, s[n], through a channel with impulse response 
h[n] to a microphone. Reverberation causes speech to sound 
distant and spectrally distorted which reduces intelligibility 
m. The further the source from the microphone the greater 
the effects of reverberation. Automatic speech recognition is 
severly hindered by reverberation EEl. Beamformers utilise 
the time difference of arrival to each sensor in an array to 
spatially filter a sound field. Due to the multi-path propa¬ 
gation, beamformers fail in reverberant environments. There¬ 
fore channel inversion methods are of high importance in spa¬ 
tial filtering fields. 

There already exists several dereverberation algorithms in 
the STFT domain. For example spectral subtraction has been 
used to estimate the power spectrum of the late reverberation 
and subtract this from the cutTent spectrum to leave the direct 
path, SI; this approach was extended in Q to introduce the 
frequency dependence of the reverberation time. 

Other methods of dereverberation exist which utilise 
knowledge of the system impulse response, h[n], however 
now exist in the STFT domain. Least squares has previously 


been used to create an inverse filter from knowledge of the 
impulse response, ||6|. This was extended into the multichan¬ 
nel domain with the Multiple-input/output INverse Theorem 
(MINT), 171, which is capable of finding exact inverse filters, 
through the use of multiple transmission channels. 

We wish to create an algorithm in the STFT domain which 
utilises knowledge of the impulse response, h[n], for the uses 
of dereverberation. However simply creating an inverse fil¬ 
ter in the STFT domain is not straightforward, as the STFT 
process is time-variant. We present a single-channel method 
of dereverberation based on a linear filter which combines 
nearby frames which uses a novel method to account for the 
time varying nature of the STFT domain. The frames are lin¬ 
early combined using coefficients computed through a least 
squares based method on the impulse response. 

The remainder of the paper is as follows. In Section|^the 
method is outlined. Sectionj^details the process to select the 
optimal coefficients for dereverberation. The results of the 
algorithm are detailed in Section]^ and conclusions are drawn 
in Section |5] 

2. STFT-DOMAIN DEREVERBERATION 

The observed reverberant signal, y[n], at the microphone is 
the convolution of the source signal, s[n], and the channel 
impulse response, h[n]: 

M-l 

y W ~ h[m]s[n — m]. (1) 

m—0 

Exploiting knowledge of the channel impulse response, we 
propose a new method to reduce the effects of reverberation 
on y[n], to form an estimate, s[n], of the original signal. 

The reverberant signal is transformed into the STFT do¬ 
main using a window, w[n] and an overlapping factor Q: 

QR-l 

Yk[l]= X y['n + lR]w[n]e~^^''^, (2) 

n—0 

where I represents the frame number, k the frequency bin 
and R the frame increment. The enhanced signal is formed 
through a linear sum of nearby frames of the reverberant sig- 


nal: 


B 

Sk[l] = Y. Gk[r]Yk[l-r], (3) 

r— — A 


where A is the number of future frames and B is the number 
of past frames to be used in the enhancement. The resulting 
frames are then transferred back into time frames with the 
inverse Discrete Fourier Transform (DFT): 


s[l, m] 


1 


Qi?-1 


k=0 


(4) 


H[l, k] 


frequency 

H[l, k] 



time 


Fig. 1. The above plots show the STFT of both H[l, k] and 
H[l,k]. For each frequency bin the filter linearly combines 
future and past frames of H[l, k] to best match H[l,k]. 


which are then overlap-added [|8l to form the enhanced time 
signal: 

In 

= 'Y s[l,n — lR]w[n — IR]. (5) 

;=;„-Q+i 

where 1^ = Perfect reconstruction, s[n] = yin], is 

obtained with the coefficients Gk[r] = 5[r] provided that the 
window used for analysis and synthesis satisfies: 

Q-i 

w'^[qR + n] = 1 Vn e [0, i? — 1]. 

3. OPTIMAL COEFFICIENTS 

Assuming that h[n] is known, our goal is to determine the 
filter coefficients Gk = A] ... so that 

s[n] « s[n]. 

Consider the response of 0 when the input signal is an 
impulse at sample A: 

— A], As[0, i?—1]. 

When processing in the STFT domain, the earliest out¬ 
put frame that is affected by the impulse occurs at Imin = 
1 — Q — A, whereas the latest frame affected is Imax = 
1 + B + ■ Appling the process from Q we can 

find a relationship between the channel STFT of the impulse 
response, H\ [l,k], and the desired impulse response H\ [l,k], 
which is the STFT of the direct path impulse response, when 
there are no reflections present. 

We determine Gk to minimise the difference between the 
two. So for each frequency bin, k, we have an overdetermined 
set of equations: 

B 

H^^'>[l,k-,Gk] =YGk[r]H<-^\l - r,k] ^ H^^\l,k], (6) 

r—A 

for each A = [0 : i? — 1] and I = [Imin ■ I'max]- This gives 
us(2-|-A-|-i3-|-Q)i?-|-M—1 equations, with A + B + 1 


unknowns. This process is shown in Fig. [T] We combine 
B past frames with A future frames to best approximate the 
current frame from the desired impulse response. 

We solve these equations using linear least squares, i9l,to 
find: 

R-l Imaa: 3 

Gk = aTgmmY Y {H^^\l,k-Gk] - H^^^[l,k]^ . 


The overall impulse response of the computed channel 
is time-variant but we can determine an average channel re¬ 
sponse as the inverse STFT of: 

H [I, =1 E exp > (8) 

where a phase shift is applied to correspond with the sample 
position within the frame. 

3.1. Time domain error bound 

The above minimisation problem minimises the reverberation 
present in the enhanced signal. Let us define the error in the 
impulse responses in both the time domain and the STFT do¬ 
main as: 

he[n] = h[n] — h[n]. 

The error in a single frame in the STFT domain is as follows: 

The total power of the error in the STFT domain across all 
frames, frequencies and shifts is denoted: 

QR-IB.-I 

= ^ E E E 

^ k = 0 A=0 l = lmin 

Using Parseval’s theorem, the power of the error in the time 
domain is given as: 

QR-l QR-1 R-1 

Y = — E E E \He[l,k]\\ (9) 




























































Alternatively we can express the error power, in the time do¬ 
main, as the weighted sum of the frames with the window 
function; 


Q-i 

he[lR + n] = w[qR + n]he[qR + nj — q]. 

q=0 

We sum over all time samples to give the total error; 

N-lR-1 

l—O n—O 
R-1 /Q-1 

EE E w[qR + n]he[qR + n,l — q] 

I n—0 V q—0 

Thus applying the Cauchy Schwatz inequality to (|^ and ( [TOl i, 
we can show that the error in the STFT domain is an upper 
bound for the time domain error; 



N-lR-l /Q-1 \ 

EE E w[qR + n]he[qR + nj — q] 

1=0 n=0 \q=0 / 

QR-lR-lN-1 


k=0 A=0 1=0 


Therefore solving the related problem in the STFT domain 
places an upper bound on the amount of reverberation in our 
output signal. 


4. EVALUATION 


To evaluate the reduction in reverberation, we use two met¬ 
rics; the Direct-to-Reverberant Ratio (DRR) Qol and the 
Signal-to-Reverberation Ratio (SRR) ifTTIl . To evaluate the 
perceptual quality of the enhanced signals Perceptual Evalua¬ 
tion Of Speech Quality (PESQ), ifT^ . is used. The DRR [dB] 
is defined as follows; 
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where Ed is the direct path energy. The direct path in the 
impulse response may occur in between samples, therefore 
the path energy will be spread across the nearby samples with 
a sine function. Thus the direct path energy is computed using 
a convolution with a sine function with a varying offset until 
a maximum is found; 


Ed (A) = max 


V 


n— — 7] 


sin (tt (n + a)) 
TT (n + a) 


hx [n + Ud] 


where Ud is the nearest index of the direct path in the impulse 
response, p = 8 is the number of sidelobes of the sine func¬ 
tion to use in the summation and cr =[— 1 ; 1] is the offset 
that finds the maximum power. 


The SRR [dB] is defined on a frame by frame basis and 
then averaged across the whole signal; 
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( 12 ) 


where M is the total number of frames, Sd[n] represents the 
orignal direct path signal and s[n] is the enhanced signal. It 
gives a measure of the reverberation power in relation to the 
useful direct path. It is a similar measure to the DRR but uses 
speech signals rather than the channel response. 

The optimal coefficients from Section were calculated 
for a Room Impulse Response (RIR) and the corresponding 
channel response from ([^ was found. A total of 600 RIRs 
were used to test the system. These correspond to a single 
source and microphone in 40 different rooms and 15 different 
position combinations in each. The impulse responses were 
generated using the Room Impulse Response Generator from 
ca, which is based on the image method M- In all cases 
Q = 4, i? = 64, A = 9, R = 9. 

As both the SRR and PESQ work on speech samples the 
TIMIT core test set ifTSll was chosen. Each speech sample was 
convolved with each h[n] before undergoing enhancement as 
described in ([a- The before and after signals, y[n] and s[n], 
were then used with the SRR and PESQ metrics to gauge any 
improvement. 

The performance of the proposed algorithm has been 
compared to the time domain inverse filter as proposed by 
Widrow, 0. The method designs an inverse filter, g\n], 
through least squares to best invert the system response, h\n], 
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where Nh = 1024 in our case. 


4.1. Results 

The DRR was computed for both /i[n] and h[n] across all 600 
RIRs. The results comparing the DRR before and after the 
algorithm are shown in Eig. The DRR improved for all 
the impulse responses tested except those where the original 
DRR exceed 0 dB. The resulting performance is independent 
of the amount of reverberation in the initial signal and hovers 
close to 6dB, giving an improvement of up to 34 dB. Thus 
the algorithm is able to reduce reverberation to the same level 
regardless of how reverberant the original channel is. 
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Fig. 2. The DRR after the algorithm for 600 RIRs shows 
a sizeable improvement over the initial reverberant signal, 
mean average improvement of 1.0 dB. 

The averaged SRR for each RIR is shown in Fig. It fol¬ 
lows a similar pattern to the DRR. The enhanced signals hover 
around 0 dB. When the original SRR surpassed 0 dB, the al¬ 
gorithm was unable to make any further improvements, and 
caused slight degredation to these non-reverberant signals. 

The averaged PESQ results are shown in Fig. The 
enhancement gave a small gain in perceptual quality which, 
whilst it does not show the removal of reverberation, does 
show that the algorithm does not introduce significant distor¬ 
tion. Due to the limited improvement in the perceived speech 
quality the algorithm has good uses in approaches which re¬ 
quire signals without reverberation, rather than end user per¬ 
ceptual improvements. 

Samples of the reverberant and processed speech are 
available on the internet; ca. 




SRR difference - dB 

Fig. 3. The speech signals after enhancement show a much 
improved SRR compared to the reverberant signals, mean av¬ 
erage improvement of 1.4 dB. 

framework that suits many applications already processing in 
this domain. 


5. CONCLUSIONS 


We have described a novel approach to dereverberation using 
a linear filter in the STFT domain. Using knowledge of the 
channel impulse response we can find an optimal combination 
of frames to reduce the effects of reverberation. The algo¬ 
rithm gives clear performance gains in dereverberation. Both 
the DRR and the SRR show that regardless of the amount of 
initial reverberation present, the enhanced signal has a simi¬ 
lar low level of reverberation present, whilst not introducing 
distortion. 

We have shown that the proposed STFT domain algorithm 
is as good as the time domain inverse filter; allowing us to 
apply dereverberation in the more appropriate domain without 
loss of performance. 

We have overcome the time-variance of the STFT by 
considering all the possible impulse positions within a single 
frame. 

By working in the STFT domain we can solve each fre¬ 
quency band, k, independently. The above give a useful 


PESQ Performance 



Fig. 4. PESQ is shown for 600 different RIRs before and after 
enhancement. Each point is the average of 240 utterances for 
that RIR, mean average improvement of 0.08 PESQ. 
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